Code
library(tidyverse)
library(broom)
library(ggridges)
library(RColorBrewer)
library(knitr)
library(DT)
library(kableExtra)
library(gganimate)
library(gifski)
library(patchwork)Investigating the Relationship Between Income and Sanitation
library(tidyverse)
library(broom)
library(ggridges)
library(RColorBrewer)
library(knitr)
library(DT)
library(kableExtra)
library(gganimate)
library(gifski)
library(patchwork)sanitation.data <- read_csv("at_least_basic_sanitation_overall_access_percent.csv")
income.data <- read_csv("income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
colnames(income.data) <- c("country", as.numeric(1799:2049))We are interested in the relationship between income and sanitation levels. One might assume that as income per person increases, so too will the access to basic necessities, including hygienic products and facilities, such as bathrooms and showers. We aim to analyze data relating to these factors, outlined in the process below.
income.data |>
slice_head(n = 1000) |>
select(country, `1999`:`2002`) |>
datatable(class = "cell-border",
rownames = FALSE,
caption = "Table 1: Preview of Income Data Set, 1999-2002.",
filter = "top")Our income data set measures the total GDP of a country per person for 216 countries between the years of 1892 and 2049, with units in fixed 2017 prices. This number is adjusted for PPP, or the differences between costs of living and essential products, across countries. The data comes from the World Bank, the Maddison Project Database, and the Penn World Table. Historical estimates were used for early years; forecasts from the IMF’s Economic Outlook were used to project income in the future.
sanitation.data |>
slice_head(n = 1000) |>
select(1:5) |>
datatable(class = "cell-border",
rownames = FALSE,
caption = "Table 2: Preview of Sanitation Data Set, 1999-2002.",
filter = "top")Our sanitation data set measures the percentage of people (living in both urban and rural settings) who use at least basic sanitation services not shared with other households. This includes flushing and pouring to piped sewer systems, septic, and ventilation for pit latrines, such as squat toilets.
The data was collected by the World Health Organization and UNICEF, falling between the years of 1999 to 2019.
We hypothesize that there will be a positive relation between income per person and percentage of people with basic sanitation at their disposal. We would like to be able to predict a country’s percentage of people who have access to basic sanitation services based on that country’s income per person.
We continue our analysis by merging these two data sets to focus on this hypothesized relationship between 1999 and 2019.
The cleaning process requires removing instances of “k” from some data entries, representing thousands of dollars. For example, we had to perform the following conversion: \(12.3k \to 12300\) for such entries. Notice that the decision to drop NA values was made, since ggplot and lm will drop NA values and we have an abundance of data entries. This decision will be more convenient later down the road when taking advantage of the country column.
convert_num <- function(x){ #converts "12.3k" to 12300,
temp <- x
if(str_detect(x, "k$")){
temp <- x |>
str_replace_all("[^[:digit:]]", "") |> #gross but it just replaces any non-number with ""
as.numeric() * 100
}
return(temp)
}
#data cleaning + joining
sanitation.clean <- sanitation.data |>
drop_na() |>
pivot_longer(cols = `1999`:`2019`,
names_to = "year",
values_to = "sanitation")
income.clean <- income.data |>
drop_na() |> #pretty bold step to remove any rows with na vals,
#but due to the abundance of data seems reasonable
select(c(country, `1999`:`2019`)) |>
pivot_longer(cols = `1999`:`2019`,
names_to = "year",
values_to = "income") |>
mutate(income = as.numeric(map(income, convert_num)))#converts characters,
#able to convert after pivot, b/c data is all seen as strings!Below is the final, merged, and cleaned data set.
full.data <- inner_join(sanitation.clean, income.clean) |>
mutate(year = as.numeric(year))
full.data |>
slice_head(n = 1000) |>
datatable(class = "cell-border",
rownames = FALSE,
caption = "Table 3: Preview of Final Data Set.",
filter = "top")We aim to piece together the relationship between income and sanitation levels with a linear model. Our goal is to develop a model that predicts the percent of people with basic sanitation (response) from adjusted income per person (explanatory).
We first visualize the relation between these two variables.
data.plot <- full.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,100,25)) +
labs(title = "Income versus Sanitation, 1999-2019",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")
data.plotAlthough the above scatter plot is very messy, we can see a general pattern: countries with low adjusted income per person have lower percentages of people with basic sanitation, whereas countries with high income per person rarely have less than 90 percent of people with basic sanitation.
us.plot <- full.data |>
mutate(year = as.numeric(year)) |>
filter(country == "United States") |>
ggplot(mapping = aes(x = income, y = sanitation)) +
geom_point() +
transition_time(year) +
labs(title = "United States Income vs. Sanitation, 1999-2019",
subtitle = "Percentage of People with Basic Santitation",
x = "Adjusted Income Per Person (in 2017 $)",
y = "",
tag = "Year: {floor(frame_time)}") +
shadow_mark(alpha = 1,
size = 1)
w = animate(us.plot, duration = 5, fps = 20, renderer = gifski_renderer(), end_pause = 40)
anim_save("animated_US_plot.gif", animation = w)
cuba.plot <- full.data |>
mutate(year = as.numeric(year)) |>
filter(country == "Cuba") |>
ggplot(mapping = aes(x = income, y = sanitation)) +
geom_point() +
transition_time(year) +
labs(title = "Cuba Income vs. Sanitation, 1999-2019",
subtitle = "Percentage of People with Basic Santitation",
x = "Adjusted Income Per Person (in 2017 $)",
y = "",
tag = "Year: {floor(frame_time)}") +
shadow_mark(alpha = 1,
size = 1)
z = animate(cuba.plot, duration = 5, fps = 20, renderer = gifski_renderer(), end_pause = 40)
anim_save("animated_Cuba_plot.gif", animation = z)However, there are instances where countries with very low income per person have high percentages of people with basic sanitation. For example, Cuba has a relatively low adjusted income per person in the years between 1999 and 2019, but still enjoyed a sanitation percentage above 88 percent; this continued to rise as income per person increased.
Another exception to the rule has some countries drop to low income per person while maintaining similar sanitation levels. There was a period in time when the United States’ adjusted income per person dropped below $20,000; yet the percentage of people with basic sanitation never drops below 99 percent within this time frame. Despite an obvious general trend, further investigation is needed to stake a stronger claim.
Our next step is to look at how the relationship between our variables of interest changed over time.
temp.plot <- full.data |>
mutate(year = as.numeric(year)) |>
ggplot(mapping = aes(x = income,
y = sanitation)) +
geom_point(show.legend = F) +
transition_time(year) +
labs(title = "Income versus Sanitation, 1999-2019",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "",
tag = "Year: {floor(frame_time)}") +
shadow_mark(keep_layer = T, size = 0.1, alpha = 0.5) +
enter_fade() +
exit_fade()
anim_time = animate(temp.plot, duration = 10, fps = 10, renderer = gifski_renderer())
anim_save("animated_time_plot.gif", animation = anim_time)The plot above, specifically the shadow trails on each point, tells us that almost every data point experiences an increase in percentage of people with basic sanitation services, coupled with overall increases in adjusted income per person. This gives us more confidence that our hypothesized relationship between our variables is correct, but more in depth analysis is needed before we can conclude anything.
Upon fitting a simple linear regression model for percent of people who have access to basic sanitation (\(Y\)) based on adjusted income per person (\(X\)), we acquired the following regression equation:
\[\hat{Y} = 55.75 + 0.0009699X\]
data.plot + geom_smooth(method = "lm", color = "red")Our simple linear regression equation tells us that a country with an average adjusted income of $0 will be predicted to have a 55.75 percent of its people with access to basic sanitation services. Note that this conclusion is an extrapolation since our data does not contain any information about any countries with a $0 adjusted income per person. Our equation also tells us for every ten thousand dollar increase in adjusted income per person, a country’s average sanitation percentage increases by 9.699 points.
To assess the validity of our linear model, we look at the variance in observed sanitation percentage, predicted sanitation percentage, and residuals.
model_var <- augment(data.fit) |>
summarise(var.resp = var(sanitation),
var.fitted = var(.fitted),
var.resid =var(.resid))
model_var |>
kbl(
col.names = c("Variance in Sanitation", "Variance in Predicted", "Variance in Residuals"),
caption = "Table 4: Variance in Observed/Predicted/Residual Sanitation."
) |>
kable_styling(bootstrap_options = "bordered",
position = "center",
font_size = 18,
html_font = "helvetica"
)| Variance in Sanitation | Variance in Predicted | Variance in Residuals |
|---|---|---|
| 942.5548 | 313.8511 | 628.7037 |
The proportion of variability accounted for by our model was 0.3329791. Our model does a good enough job to communicate the general idea that the more money people are making, the make likely it is that they will have access to basic sanitation services. However, the variable we are hoping to predict is a percentage which means it can not exceed 100%, exposing the flaw that our model will predict impossible to reach sanitation percentages for high enough average income inputs. This gives us problems regarding what we can and can not conclude due to risk of extrapolating further than our data provides. There is not much we can do about this using a simple linear model as we would have to introduce a polynomial regression equation (say, through a logarithmic transformation) to be more precise.
To further investigate the quality of our model, we created a simulated data set using our regression model and the actual adjusted income amounts as inputs. Below are plots comparing the observed data and our simulated data.
data_predict <- predict(data.fit) #predicted values
data_sig <- sigma(data.fit) #s or residual std error
noise <- function(x, mean = 0, sd){ #stolen from Prof Robinson :,)
x + rnorm(length(x),
mean,
sd)
}
sim_response <- tibble(sanitation = noise(data_predict,
sd = data_sig)
)full.dist <- full.data |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Observed Sanitation Index",
y = "",
subtitle = "Count") +
theme_bw()
sim.dist <- sim_response |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Simulated Sanitation Index",
y = "",
subtitle = "Count") +
theme_bw()
full.dist + sim.distAbove are histograms of basic sanitation percentage in the observed data and our simulated data, which highlight the poor quality of our model. While our observed data is skewed to the left, the simulated data follows a normal distribution. Furthermore, they highlight the weakness that our model outputs sanitation percentages that are more than 100 percent, which make some conclusions meaningless.
sim.plot <- sim_response |>
ggplot(mapping = aes(x = full.data$income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,200),
breaks = seq(0,200,25)) +
labs(title = "Predicted Sanitation versus Income, 1999-2019",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")
data.plot2 <- full.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,200),
breaks = seq(0,200,25)) +
labs(title = "Income versus Sanitation, 1999-2019",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $)",
y = "")
sim.plot + data.plot2Comparing plots of percentage of people with basic sanitation and adjusted income per person for the observed and simulated data provides us more evidence of the flaws of our model. Even though the simulated data communicates the general idea of the observed data, it seems that our model oversimplifies the relationship between our variables.
sim_data <- full.data |>
select(sanitation) |> #dont have to filter since we threw out na values in beggining
bind_cols(sim_response)
sim_data |>
ggplot(aes(x = sanitation...1,
y = sanitation...2)
) +
geom_point(alpha = 0.4) +
labs(x = "Observed Sanitation",
y = "",
subtitle = "Simulated Sanitation" ) +
geom_abline(slope = 1,
intercept = 0,
color = "steelblue",
linetype = "dashed",
lwd = 1.5) +
theme_bw()Overall, it appears that there are many over estimates for low observed sanitation and a range of estimates at large observed sanitation, with majority being under. It appears there is a “weak” relationship between the observed values and simulated values, as the points are far from the linear. However based on previous stat knowledge it does not appear that a linear model is the most appropriate here and a transformation may be due; however, we will explore this later in the article.
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
.f = ~ tibble(sim = noise(data_predict,
sd = data_sig)
)
)
colnames(sims) <- colnames(sims) |>
str_replace(pattern = "\\.\\.\\.",
replace = "_")
temp <- full.data |>
select(sanitation) |>
bind_cols(sims)
sim_r_sq <- temp |>
map(~ lm(sanitation ~ .x, data = temp)) |>
map(glance) |>
map_dbl(~ .x$r.squared) #removes first col
tibble(sims = sim_r_sq[-1]) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.0025) +
labs(titel = "Distribution of $R^2$",
x = expression("Simulated"~ R^2),
y = "",
subtitle = "Number of Simulated Models")normean = mean(sim_r_sq[-1])
#.319 in recent runIn this plot, we see that the simulated datasets have \(R^2\) values between 0.08 and 0.15. This indicates the data simulated under this statistical model is weakly similar to what was observed. On average, our simulated data account for about 11% of the variability in the observed sanitation index. This seems to be the final nail in the coffin for us to be able to say for sure that this model is not good.
We wanted to attempt a log transformation of adjusted income per person in order to improve our model. Upon doing this, we received the following regression equation.
temp.data <- full.data
temp.data$income <- log(temp.data$income, base = 2) #transforming by log base 2
temp.fit <- lm(sanitation~income, data = temp.data) #fitting new model\[\hat{Y} = -102.2536 + 13.32\text{log}_2(X)\]
temp.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point() +
geom_smooth(method = lm, level = 0, color = "red") +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,100,25)) +
labs(title = "Log Transformed Income versus Sanitation, 1999-2049",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Log Adjusted Income per Person (in 2017 $)",
y = "")If we double adjusted income per person, our new model predicts a 13.31 percent increase in the mean value of percentage of people with access to basic sanitation. The proportion of variability accounted for by this model was 0.5651. This is much better than the non transformed model, but still not great. This model still outputs sanitation percentages over 100 percent, but does it much less often. The data does seem to curve, so the introduction of a polynomial regression equation might improve the model; but for our purposes, this is sufficient.
data_predict <- predict(temp.fit) #predicted values
data_sig <- sigma(temp.fit) #s or residual std error
sim_response_log <- tibble(sanitation = noise(data_predict,
sd = data_sig)
)We decided another simulation was needed on this model to see how much better it was compared to our previous one.
temp.dist.log <- temp.data |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Observed Sanitation Index (log)",
y = "",
subtitle = "Count") +
theme_bw()
sim.dist.log <- sim_response_log |>
ggplot(aes(x = sanitation)) +
geom_histogram() +
labs(x = "Simulated Sanitation Index (log)",
y = "",
subtitle = "Count") +
theme_bw()
temp.dist.log + sim.dist.logsim.dist.log + sim.distOverlall, we see that our simulated sanitation index with the log transformation is similarly centered as our non-transformed model. However, our transformed distribution is more stout, with a slight reduction in range and less of a tendency to have values around the mean.
sim.plot.log <- sim_response_log |>
ggplot(mapping = aes(x = temp.data$income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,200),
breaks = seq(0,200,25)) +
labs(title = "Predicted Sanitation versus Log Income, 1999-2019",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $ log)",
y = "")
data.plot2 <- temp.data |>
ggplot(mapping = aes(x = income, y = sanitation) ) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,200),
breaks = seq(0,200,25)) +
labs(title = "Log Income versus Sanitation, 1999-2019",
subtitle = "Percentage of People with Basic Sanitation Services",
x = "Adjusted Income per Person (in 2017 $ log)",
y = "")
sim.plot.log + data.plot2We still have dramatic differences in the distribution and shape of these plots, however, the general trend between these two is much more similar compared to our non-transformed.
sim_data <- temp.data |>
select(sanitation) |> #dont have to filter since we threw out na values in beggining
bind_cols(sim_response_log)
sim_data |>
ggplot(aes(x = sanitation...1,
y = sanitation...2)
) +
geom_point(alpha = 0.4) +
labs(x = "Observed Sanitation",
y = "",
subtitle = "Simulated Sanitation" ) +
geom_abline(slope = 1,
intercept = 0,
color = "steelblue",
linetype = "dashed",
lwd = 1.5) +
theme_bw()It seems that when comparing the simulated to observed, we still tend to over predict at lower sanitation instances and generally under predict when the observed sanitation is higher or near 100. Next, we can visualize how this relationship behaves over the long runs with many simulated sanitation values under a similar process to before.
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
.f = ~ tibble(sim = noise(data_predict,
sd = data_sig)
)
)
colnames(sims) <- colnames(sims) |>
str_replace(pattern = "\\.\\.\\.",
replace = "_")
temp <- temp.data |>
select(sanitation) |>
bind_cols(sims)
sim_r_sq <- temp |>
map(~ lm(sanitation ~ .x, data = temp)) |>
map(glance) |>
map_dbl(~ .x$r.squared)#removes first col
tibble(sims = sim_r_sq[-1]) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.0025) +
labs(x = expression("Simulated"~ R^2),
y = "",
subtitle = "Number of Simulated Models")logmean = mean(sim_r_sq[-1])In this plot, we see that the simulated data sets have \(R^2\) values between 0.3 and 0.34. This indicates the data simulated under this statistical model is weakly similar to what was observed. On average, our simulated data account for about 32% of the variability in the observed sanitation index with a transformed log model.
Income Data Set. Collected by World Bank, Maddison Project, Penn World Table (University of Groningen).
Sanitation Data Set. Collected by World Health Organization, UNICEF.